A man was shot and killed in the Carroll Park section of West Philadelphia around noon on Tuesday, Oct. 7, 2020. The incident was the second fatal daytime shooting in Philadelphia within 24 hours: Around 1 p.m. Monday, a 21-year-old man was killed in the parking lot outside the Lowe’s store on Columbus Boulevard in South Philly. The victim was an employee at the store.
Six people were shot and killed on Monday in the city, pushing the total number of homicides in 2020 in Philadelphia higher than the total for 2019. There have been 364 reported homicides and 1,615 people shot so far in 2020. According to the Inquirer, the numbers show gun violence is occurring at the highest Philly has seen in a decade.
For years, gun violence in Philadelphia is an epidemic that the Police Department is battling. Building predictive policing model is one of tries the Department have made to tackle the dire problem.
Experiment of leveraging a predictive pricing model for capturing crime has been conducted in Philadelphia. Starting from 2017, Temple University’s Center for Security and Crime Science, housed in the Department of Criminal Justice at Temple, and the Philadelphia Police Department have developed a two-year collaboration for conducting the Philadelphia Predictive Policing Experiment that was funded by the National Institute of Justice. This research project was the first place-based, randomized experiment to study the impact of different patrol strategies on violent and property crime in predicted criminal activity areas. The experiment hopes to learn whether different but operationally-realistic police responses to crime forecasts estimated by a predictive policing software program can reduce crime. So far, they have drawn basic conclusions that the marked car treatment showed substantial benefits for property crime (31% reduction in expected crime count), as well as temporal diffusion of benefits to the subsequent 8-h period (40% reduction in expected crime count). No other intervention demonstrated meaningful crime reduction. These reductions were probably not substantial enough to impact city or district-wide property crime. Some violent crime results ran contrary to expectations, but this happened in a context of extremely low crime counts in predicted areas. The small grid size areas hampered achieving statistical power.
Inspired by previous efforts and present Black Lives Matters demonstrations, this analysis makes effort to develop a predictive policing model for the shootings in Philadelphia with the goal of achieving accurate prediction and robust generalizability, as well as reducing systematic racism when it comes to gun violence.
In this section, we loaded necessary libraries, created plot theme options and map theme options, and identified functions of quintile breaks, and average nearest neighbor distance for further analysis.
# 1. load Libraries
library(sf)
library(tidyverse)
# install.packages('mapview')
library(mapview)
library(spdep)
library(caret)
library(ckanr) # for opening data APIs built on CKAN technology
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)
library(jtools) # for regression model plots
library(stargazer) # for creating table
library(broom)
library(tufte)
library(rmarkdown)
library(kableExtra)
library(tidycensus)
# new in Lab6
library(RSocrata)
library(viridis)
library(spatstat) # make kernel density map
library(raster)
library(knitr)
library(rgdal)
# 2. Identify functions
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 15,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.text.x = element_text(size = 14))
}
plotTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(color = "darkred", size=15, face="bold"),
plot.subtitle = element_text(face="italic"),
plot.caption = element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),
panel.grid.major = element_line("grey80", size = 0.1),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2),
strip.background = element_rect(fill = "grey80", color = "white"),
strip.text = element_text(size=12),
axis.title = element_text(size=12),
axis.text = element_text(size=10),
plot.background = element_blank(),
legend.background = element_blank(),
legend.title = element_text(colour = "black", face = "italic"),
legend.text = element_text(colour = "black", face = "italic"),
strip.text.x = element_text(size = 14)
)
}
# 3. Load Quantile break functions
qBr <- function(df, variable, rnd) {
if (missing(rnd)) {
as.character(quantile(round(df[[variable]],0),
c(.01,.2,.4,.6,.8), na.rm=T))
} else if (rnd == FALSE | rnd == F) {
as.character(formatC(quantile(df[[variable]]), digits = 3),
c(.01,.2,.4,.6,.8), na.rm=T)
}
}
q5 <- function(variable) {as.factor(ntile(variable, 5))}
# Load hexadecimal color palette
palette <- c('#feedde', '#fdbe85', '#fd8d3c', '#e6550d', '#a63603')
# for calculating average nearest neighbor distance.
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <- as.matrix(measureFrom)
measureTo_Matrix <- as.matrix(measureTo)
nn <-
get.knnx(measureTo, measureFrom, k)$nn.dist
output <-
as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
dplyr::summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull() # pull() is similar to $. It's mostly useful because it looks a little nicer in pipes, it also works with remote data frames, and it can optionally name the output.
return(output)
}1. Base Map Datasets
Philadelphia Boundary: A base and outline of the City of Philadelphia.
Philadelphia Neighborhoods: A base and outline of the neighborhoods of Philadelphia.
Police Districts: Geojson file contains police district ploygon information.
2. Crime Data: This analysis focuses on shooting victim cases that also suffers from selection bias, such as racial bias when it comes to crime prediction.
3. Risk Factors Data: This analysis selects five point level features that may intrigue shooting cases to build the model. See each dataset below:
Building Demolitions: A point feature class of demolitions occurred in 2018, performed by private owners/contractors and by the Department of Licenses and Inspections due to dangerous building conditions.
Vacant Land: A point feature class of the location of properties across Philadelphia that are likely to be a vacant land based on an assessment of City of Philadelphia administrative datasets.
Tobacco Retailer: A point feature class of the location of tobacco retailers that applied for a tobacco permit through an online application system in 2018.
Tobacco Youth Sales Violations: This dataset contains violations for tobacco sale to minors in 2018.
Affordable Housing: A point feature class of the affordable housing units built in or before 2018.
Census Data: demographic variables from the ACS 2018 for census tracts in City of Philadelphia. This analysis mainly focuses on median household income, race context, poverty level, and umployment rate.
# polygon
phillypoliceDistricts <-
st_read("http://data-phl.opendata.arcgis.com/datasets/62ec63afb8824a15953399b1fa819df2_0.geojson") %>%
st_transform('ESRI:102728') %>%
dplyr::select(District = DIST_NUM)
phillyBoundary <-
st_read("http://data.phl.opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson") %>%
st_transform('ESRI:102728')
# 2018 Shooting Victims
shootings <-
st_read('https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+shootings&filename=shootings&format=geojson&skipfields=cartodb_id') %>%
filter(year == "2018") %>%
na.omit() %>%
st_transform('ESRI:102728')
# Creating a fishnet grid
philly_fishnet <-
st_make_grid(phillyBoundary,
cellsize = 500,
square = TRUE) %>%
.[phillyBoundary] %>%
st_sf() %>%
mutate(uniqueID = rownames(.))
phillyneigh <-
st_read("https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson") %>%
st_transform('ESRI:102728') %>%
st_transform(st_crs(philly_fishnet))
# import risk factors
building_demolition <-
st_read('https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+demolitions&filename=demolitions&format=geojson&skipfields=cartodb_id') %>%
mutate(year = substr(start_date,1,4)) %>%
filter(year == '2018') %>%
st_transform('ESRI:102728') %>%
st_transform(st_crs(philly_fishnet)) %>%
mutate(Legend = "Building Demolition") %>%
dplyr::select(geometry, Legend)
vacant_land <-
st_read('https://opendata.arcgis.com/datasets/b990222a527849229b4192feb4c42dc0_0.geojson') %>%
filter(VACANT_FLAG == "Land") %>%
na.omit() %>%
st_transform('ESRI:102728') %>%
st_transform(st_crs(philly_fishnet)) %>%
mutate(Legend = "Vacant Land") %>%
dplyr::select(geometry, Legend)
tobacco <-
st_read('http://data-phl.opendata.arcgis.com/datasets/853a1421e738471b8cc0d6ff755d47ff_0.geojson') %>%
st_transform('ESRI:102728') %>%
st_transform(st_crs(philly_fishnet)) %>%
mutate(Legend = "Tobacco") %>%
dplyr::select(geometry, Legend) %>%
na.omit()
tobacco_violation <-
st_read('https://opendata.arcgis.com/datasets/25b43fb8cae84e8a89d74d8707bbb5f2_0.geojson') %>%
filter(COMP_CHK_Y == '2018')%>%
na.omit() %>%
st_transform('ESRI:102728') %>%
st_transform(st_crs(philly_fishnet)) %>%
mutate(Legend = "Tobacco Violation") %>%
dplyr::select(geometry, Legend)
affordable_housing <-
st_read('https://opendata.arcgis.com/datasets/ca8944190b604b2aae7eff17c8dd9ef5_0.geojson') %>%
filter(FISCAL_YEAR_COMPLETE <= "2018") %>%
st_transform('ESRI:102728') %>%
st_transform(st_crs(philly_fishnet)) %>%
mutate(Legend = "Affordable Housing") %>%
dplyr::select(geometry, Legend)
# Census data
# View(load_variables(2018,'acs5',cache = TRUE))
tracts18 <-
get_acs(geography = "tract", variables = c("B00001_001E","B02001_002E","B19013_001E","B25002_001E","B06012_002E","B27011_008E"),
year=2018, state=42, county=101, geometry=T, output="wide") %>%
st_transform('ESRI:102728') %>%
rename(TotalPop = B00001_001E,
Whites = B02001_002E,
MedHHInc = B19013_001E,
TotalUnit = B25002_001E,
TotalPoverty = B06012_002E,
TotalUnemployment = B27011_008E) %>%
dplyr::select(-NAME, -starts_with("B")) %>% #-starts_with("B") awesome!
mutate(pctWhite = ifelse(TotalPop > 0, Whites / TotalPop * 100,0),
pctPoverty = ifelse(TotalPop > 0, TotalPoverty / TotalPop *100, 0),
pctUnemploy = ifelse(TotalPop > 0, TotalUnemployment / TotalPop *100, 0)
) %>%
dplyr::select(-Whites, -TotalPoverty ,-TotalUnemployment,-GEOID) %>%
st_transform(st_crs(philly_fishnet))
tracts18.MedHHInc <- tracts18 %>%
dplyr::select(MedHHInc) %>%
rename(Legend = MedHHInc)
tracts18.pctWhite <- tracts18 %>%
dplyr::select(pctWhite)%>%
rename(Legend = pctWhite)
tracts18.pctPoverty <- tracts18 %>%
dplyr::select(pctPoverty)%>%
rename(Legend = pctPoverty)
tracts18.pctUnemploy <- tracts18 %>%
dplyr::select(pctUnemploy)%>%
rename(Legend = pctUnemploy)METHODThis analysis uses open sourced data from the City of Philadelphia’s Open Data Portal and the US Census Bureau’s American Community Survey. From these sources, I wrangle data and design features for risk factors involving income, education, age, and demographics. Using a fishnet grid, I break the city into square cells to design a model which predicts the risk of shooting cases occurrence in each cell based on its risk factor data.
Figure 1.1 and 1.2 below show the distribution of Philadelphia observed shooting crimes in 2018. Obviously, Oak Lane, Upper North, North, West, and South West are suffering more from shooting crimes compared to other neighborhoods. Considering those neighborhoods are also known for poverty and less-development, otential model might identify areas in a neighborhood where serious shooting crime are more likely to occur during a particular period because its social environment such as less-developed economy, gather of racial minority, and etc where the selection bias might occur.
# 1. A map of shooting crime in 2018, Philadelphia
grid.arrange(ncol=2,
ggplot() +
geom_sf(data = phillyBoundary) +
geom_sf(data = shootings, colour="darkred", size=0.5, show.legend = "point") +
labs(title= "Observed Shooting Cases in 2018",
subtitle = 'Philadelphia, PA\n',
caption = 'Figure 1.1') +
mapTheme() +
plotTheme(),
ggplot() +
geom_sf(data = phillyBoundary, fill = "#E5E5E5") +
stat_density2d(data = data.frame(st_coordinates(shootings)),
aes(X, Y, fill = ..level.., alpha = ..level..),
size = 0.01, bins = 40, geom = 'polygon') +
scale_fill_viridis_c(option = "plasma") +
scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
labs(title = "Density of Observed Shooting Cases in 2018",
subtitle = 'Philadelphia, PA\n',
caption = 'Figure 1.2') +
mapTheme() +
theme(legend.position = "none") +
plotTheme())Figure 2 presents the shooting cases data in a fishnet which is the basic scale this analysis mainly works at. Similarly, shooting crime are more common in North Philly and West Philly.
# Aggregate points to the fishnet
shooting_net <-
dplyr::select(shootings) %>%
mutate(countshootings = 1) %>%
aggregate(., philly_fishnet, sum) %>%
mutate(countshootings = replace_na(countshootings, 0),
uniqueID = rownames(.),
cvID = sample(round(nrow(philly_fishnet) / 24),
size=nrow(philly_fishnet), replace = TRUE))
ggplot() +
geom_sf(data = shooting_net, aes(fill = countshootings), color = NA) +
scale_fill_viridis_c(option = "plasma",
name = 'Shooting Counts') +
labs(title = "Observed Shootings Joined to Fishnet, 2018",
subtitle = 'Philadelphia, PA\n',
caption = 'Figure 2') +
mapTheme() +
plotTheme()To build a model which can capture more characteristics of shooting occurrence pattern, this analysis selects risk factors that are correlated with shooting more or less based on previous studies. These risk factors are building demolitions occurring within the City of Philadelphia, legal tobacco retailer stations, affordable housing level, tobacco sale violations to minors, vacant land ratio, and other demographic factors (household income, poverty level, unemployment rate, and racial group) fetched from census data.
# All variables in fishnet
vars_net <-
rbind(building_demolition, tobacco, affordable_housing, tobacco_violation,
vacant_land) %>%
st_join(., philly_fishnet, join=st_within) %>%
st_drop_geometry() %>%
group_by(uniqueID, Legend) %>%
summarize(count = n()) %>%
full_join(philly_fishnet, by = "uniqueID") %>%
spread(Legend, count, fill=0) %>%
st_sf() %>%
na.omit() %>%
dplyr::select(-`<NA>`) %>%
ungroup()
### Multiple map for feature counts in fishnet
vars_net.long <-
gather(vars_net, Variable, value, -geometry, -uniqueID)
vars <- unique(vars_net.long$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis_c(option = "plasma",
name = " ") +
labs(title=i) +
mapTheme()}
do.call(grid.arrange,c(mapList, ncol =2, top = "Figure. 3.1 Risk Factors by Fishnet\n"))## 3.2 Nearest Neighbor Feature
# convenience to reduce length of function names.
st_c <- st_coordinates
st_coid <- st_centroid
## create NN from abandoned cars, k = 3
'%!in%' <- function(x,y)!('%in%'(x,y))
vars_net$tobacco.nn <-
nn_function(st_c(st_coid(vars_net)),
st_c(tobacco),
k = 3)
vars_net$building_demolition.nn <-
nn_function(st_c(st_coid(vars_net)),
st_c(building_demolition),
k = 3)
vars_net$affordable_housing.nn <-
nn_function(st_c(st_coid(vars_net)),
st_c(affordable_housing %>%
filter(geometry %!in% 'c(NaN, NaN)')),
k = 3)
vars_net$tobacco_violation.nn <-
nn_function(st_c(st_coid(vars_net)),
st_c(tobacco_violation),
k = 3)
vars_net$vacant_land.nn <-
nn_function(st_c(st_coid(vars_net)),
st_c(vacant_land),
k = 3)
## Visualize the nearest three features
vars_net.long.nn <-
dplyr::select(vars_net, ends_with(".nn")) %>%
gather(Variable, value, -geometry)
vars <- unique(vars_net.long.nn$Variable)
mapList <- list()
for(i in vars){
mapList[[i]] <-
ggplot() +
geom_sf(data = filter(vars_net.long.nn, Variable == i), aes(fill=value), colour=NA) +
scale_fill_viridis_c(option = "plasma",
name = " ") +
labs(title=i) +
mapTheme() +
plotTheme()}
do.call(grid.arrange,c(mapList, ncol = 2, top = "Figure 3.2 Nearest Neighbor risk Factors by Fishnet\n"))# IV and DVs all in fishnet
philly_final_net <-
left_join(shooting_net, st_drop_geometry(vars_net), by="uniqueID")
philly_final_net <-
st_centroid(philly_final_net) %>%
st_join(dplyr::select(phillyneigh, mapname), by = "uniqueID") %>%
st_join(dplyr::select(phillypoliceDistricts, District), by = "uniqueID") %>%
st_drop_geometry() %>%
left_join(dplyr::select(philly_final_net, geometry, uniqueID)) %>%
st_sf() %>%
na.omit()Now with our complete set of risk factor features, we explore the correlation between the features and observed shooting crime Figure 5 below visualizes some of these correlations. It is intuitive that all the features selected are positively correlated to shooting crime case counts.
# 5. A small multiple scatterplot with correlations
philly.correlation.long <-
st_drop_geometry(philly_final_net) %>%
dplyr::select(-uniqueID, -cvID,-starts_with('mapname'),-starts_with('District')) %>%
gather(Variable, Value, -countshootings)
philly.correlation.cor <-
philly.correlation.long %>%
group_by(Variable) %>%
summarize(philly.correlation = cor(Value, countshootings, use = "complete.obs"))
ggplot(philly.correlation.long, aes(Value, countshootings)) +
geom_point(size = 0.1) +
geom_text(data = philly.correlation.cor, aes(label = paste("r =", round(philly.correlation, 2))),
x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
geom_smooth(method = "lm", se = FALSE, colour = "black") +
facet_wrap(~Variable, ncol = 4, scales = "free") +
labs(title = "Observed Shooting Crime as a Function of Risk Factors",
subtitle = "Philadelphia, PA\n",
caption = "Figure 5")+
plotTheme()Given shooting is a relatively rare event, it is reasonable for most grid cells to contain no crime events. And based on the histogram below, the distribution of the dependent variable is Poisson distribution, indicating a Poisson model should be considered for model building.
# 6. A histogram of dependent variable
ggplot(data = philly_final_net) +
geom_histogram(aes(x = countshootings), fill = 'orange') +
labs(title="Histogram of Dependent Variable: Count of Observed Shootings",
subtitle = "Philadelphia, PA\n",
caption = "Figure 6") +
xlab('Count of Shootings') +
ylab('Count') +
plotTheme()For a given risk factor, I selected feature counts in each fishnet and distance to shooting hotpots as well as distance to significant shooting hotpots as features used in the model to avoid collinerity. I also added Local Moran’s I spatial process features based on the result in Question 4. Therefore, we have eleven features in total.
Goodness of fit metrics are generated for four regressions - two including Just Risk Factors (reg.vars), and a second (reg.ss.vars) includes risk factors plus the Local Moran’s I Spatial Process features created
A well generalized crime predictive model learns the crime risk ‘experience’ at both citywide and local spatial scales. The best way to test for this is to hold out one local area, train the model on the remaining n - 1 areas, predict for the hold out, and record the goodness of fit, which is the cross-validation approach called LEAVE ONE GROUP OUT (LOGO-CV) used in the analysis. reg.ss.cv performs random k-fold cross validation using spatial process features, while reg.ss.spatialCV performs LOGO-CV, spatial cross-validation on neighborhood name, using the same features. Same processes are also conducted on Just Risk Factors.
Figure 7.1 below displays the distribution of MAE across all models. Some small errors clustering can be captured, indicating the models perform well in terms of generalizability across the city.
#### 7.1 Distribution of MAE
error_by_reg_and_fold %>%
ggplot(aes(MAE)) +
geom_histogram(bins = 30, colour="black", fill = "#5dccb9") +
facet_wrap(~Regression) +
geom_vline(xintercept = 0) +
scale_x_continuous(breaks = seq(0, 8, by = 1)) +
labs(title="Distribution of MAE",
subtitle = "k-fold cross validation vs. LOGO-CV\n",
caption = "Figure 7.1",
x="Mean Absolute Error", y="Count") +
plotTheme()Figure 7.2 provides additional information on the errors across models.
error_by_reg_and_fold %>%
ggplot() +
geom_sf(aes(fill = MAE)) +
facet_wrap(~Regression) +
scale_fill_viridis_c(option = "plasma") +
labs(title = "Shooting Errors by Rondom k-fold and LOGO-CV Regression\n",
caption = 'Figure 7.2') +
mapTheme() +
theme(legend.position="bottom") +
plotTheme()Interestingly, Figure 7.3 below shows that all models over-predict in low shooting-rate areas and under-predict in hot spot areas. Over-predictions in lower shooting areas may highlight areas of latent risk. Under-prediction in higher shooting areas may reflect difficulty predicting the hotspots.
st_drop_geometry(reg.summary) %>%
group_by(Regression) %>%
mutate(shooting_Decile = ntile(countshootings, 10)) %>%
group_by(Regression, shooting_Decile) %>%
summarize(meanObserved = mean(countshootings, na.rm=T),
meanPrediction = mean(Prediction, na.rm=T)) %>%
gather(Variable, Value, -Regression, -shooting_Decile) %>%
ggplot(aes(shooting_Decile, Value, shape = Variable)) +
geom_point(size = 2) +
geom_path(aes(group = shooting_Decile), colour = "black") +
scale_shape_manual(values = c(2, 17)) +
facet_wrap(~Regression) + xlim(0,10) +
labs(title = "Predicted and Observed Shootings by Observed Shooting Decile\n",
subtitle = "Philadelphia, PA\n",
caption = 'Figure 7.3',
x = 'Shooting Decile') +
plotTheme()## `summarise()` regrouping output by 'Regression' (override with `.groups` argument)
Table 8 below confirms my assumption that the Spatial Process features improve the model. The model appears consistently robust even in conservative LOGO-CV. For your reference, the average count of observed shooting cases in each cell is 0.0837, while the average count of predicted shooting cases in each cell is 0.0856.
st_drop_geometry(error_by_reg_and_fold) %>%
group_by(Regression) %>%
summarize(Mean_MAE = round(mean(MAE), 2),
SD_MAE = round(sd(MAE), 2)) %>%
kable(caption = "Table 8. MAE by regression") %>%
kable_styling("striped", full_width = F) | Regression | Mean_MAE | SD_MAE |
|---|---|---|
| Random k-fold CV: Just Risk Factors | 0.07 | 0.11 |
| Random k-fold CV: Spatial Process | 0.05 | 0.04 |
| Spatial LOGO-CV: Just Risk Factors | 0.11 | 0.13 |
| Spatial LOGO-CV: Spatial Process | 0.04 | 0.05 |
To be generalizable, the models must produce errors on the similar level in both majority white and majority non-white neighborhoods. However, shown in Table 9, the MAE of Majority_Non_White neighborhoods is higher for non-white neighborhoods. In other word, each model on average, under-predicts in Majority_Non_White neighborhoods and over-predicts in Majority_White neighborhoods. It looks like this algorithm does not generalize well with respect to race.
ggplot() +
geom_sf(data = na.omit(tracts18.race), aes(fill = raceContext)) +
scale_fill_manual(values = c("#10cbaf", "#ff9966"), name="Race Context") +
labs(title = "Race Context",
subtitle = "Philadelphia, PA\n",
caption = 'Figure 9') +
mapTheme() +
theme(legend.position="bottom") +
plotTheme()reg.summary %>%
filter(str_detect(Regression, "LOGO")) %>%
st_centroid() %>%
st_join(tracts18.race) %>%
na.omit() %>%
st_drop_geometry() %>%
group_by(Regression, raceContext) %>%
summarize(mean.Error = mean(Error, na.rm = T)) %>%
spread(raceContext, mean.Error) %>%
kable(caption = "Table 9. Mean Error by Neighborhood Racial Context") %>%
kable_styling("striped", full_width = F) | Regression | Majority_Non_White | Majority_White |
|---|---|---|
| Spatial LOGO-CV: Just Risk Factors | -0.1406815 | 0.0044104 |
| Spatial LOGO-CV: Spatial Process | -0.0314771 | 0.0006632 |
In the last section, this analysis is aimed at figuring out if the predictive policing model is able to better predict shooting crime compared with traditional model - Kernel Density Model.
By making kernel density maps with different search radius scale (1000 Ft, 1500 Ft, and 2000 Ft), this analysis select the one with 1000 Ft search radius since it can better display the density of shooting cases (see Figure 10.1).
Next, a comparison map is generated of the risk categories for both model types with a sample of shooting cases in 2019 points overlaid. A strongly fit model should show that the highest risk category is uniquely targeted to places with a high density of shooting points. Indicated from Figure 10.2, although the traditional model can show the shooting risk pattern in 2019 more or less, the predictive policing model is far more accurate and direct.
### 10.1 Make Kernel Density Map
sho_ppp <- as.ppp(st_coordinates(shootings), W = st_bbox(philly_final_net))
sho_KD.1000 <- spatstat::density.ppp(sho_ppp, 1000)
sho_KD.1500 <- spatstat::density.ppp(sho_ppp, 1500)
sho_KD.2000 <- spatstat::density.ppp(sho_ppp, 2000)
sho_KD.df <- rbind(
mutate(data.frame(rasterToPoints(mask(raster(sho_KD.1000), as(phillyneigh, 'Spatial')))), Legend = "1000 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(sho_KD.1500), as(phillyneigh, 'Spatial')))), Legend = "1500 Ft."),
mutate(data.frame(rasterToPoints(mask(raster(sho_KD.2000), as(phillyneigh, 'Spatial')))), Legend = "2000 Ft."))
sho_KD.df$Legend <- factor(sho_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))
ggplot(data = sho_KD.df, aes(x = x, y = y)) +
geom_raster(aes(fill=layer)) +
facet_wrap(~Legend) +
coord_sf(crs=st_crs(philly_final_net)) +
scale_fill_viridis_c(option = "plasma",
name = "Density") +
labs(title = "Kernel Density with 3 Different Search Radius Scales",
subtitle = "Shooting Cases, 2018, Philadelphia, PA\n",
caption = 'Figure 10.1') +
mapTheme() +
plotTheme()sho_KDE_sf <- as.data.frame(sho_KD.1000) %>%
st_as_sf(coords = c("x", "y"), crs = st_crs(philly_final_net)) %>%
aggregate(., philly_final_net, mean) %>%
mutate(label = "Kernel Density",
Risk_Category = ntile(value, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(shootings19) %>% mutate(shootingCount = 1), ., sum) %>%
mutate(shootingCount = replace_na(shootingCount, 0))) %>%
dplyr::select(label, Risk_Category, shootingCount)
##### 10.4 Prediction by Risk Prediction Model
sho_risk_sf <-
reg.ss.spatialCV %>%
mutate(label = "Risk Predictions",
Risk_Category = ntile(Prediction, 100),
Risk_Category = case_when(
Risk_Category >= 90 ~ "90% to 100%",
Risk_Category >= 70 & Risk_Category <= 89 ~ "70% to 89%",
Risk_Category >= 50 & Risk_Category <= 69 ~ "50% to 69%",
Risk_Category >= 30 & Risk_Category <= 49 ~ "30% to 49%",
Risk_Category >= 1 & Risk_Category <= 29 ~ "1% to 29%")) %>%
cbind(
aggregate(
dplyr::select(shootings19) %>% mutate(shootingCount = 1), ., sum) %>%
mutate(shootingCount = replace_na(shootingCount, 0))) %>%
dplyr::select(label,Risk_Category, shootingCount)
rbind(sho_KDE_sf, sho_risk_sf) %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
ggplot() +
geom_sf(aes(fill = Risk_Category), colour = NA) +
geom_sf(data = sample_n(shootings19, 1130), size = .5, colour = "black") +
facet_wrap(~label, ) +
scale_fill_viridis_d(option = "plasma",
name = 'Risk Category') +
labs(title="Comparison of Kernel Density and Risk Predictions",
subtitle="Bottom layer is 2019 predicted shooting counts.\nDot is observed 2019 shooting counts.\n",
caption = 'Figure 10.2') +
mapTheme() +
plotTheme()However, indicated from Figure 11, it is seemingly the advantages of predictive policing model is not consistent across all risk categories. For all risk categories lower than 90%, the traditional Kernel Density model is actually more accurate. For the highest category, however, the predictive policing risk model is more accurate. This finding might imply that the predictive policing model built in this analysis is better at locating shooting crime risk hotspots, and this again confirms the conclusion.
rbind(sho_KDE_sf, sho_risk_sf) %>%
st_set_geometry(NULL) %>%
na.omit() %>%
gather(Variable, Value, -label, -Risk_Category) %>%
group_by(label, Risk_Category) %>%
summarize(countShootings = sum(Value)) %>%
ungroup() %>%
group_by(label) %>%
mutate(Rate_of_test_set_crimes = countShootings / sum(countShootings)) %>%
ggplot(aes(Risk_Category,Rate_of_test_set_crimes)) +
geom_bar(aes(fill=label), position="dodge", stat="identity") +
scale_fill_viridis_d(option = "plasma",
name = 'Model') +
labs(title = "Risk Prediction vs. Kernel Density, 2019 Shooing Crime",
subtitle = 'Philadelphia, PA',
caption = 'Figure 11',
x = 'Risk Category',
y = 'Rate of Test Set Shooting Crimes') +
plotTheme()Generally speaking, this model achieves the goal of identifying shooting crime risk hotspots in Philadelphia since it predicts hotspots more accurately than the traditional - Kernel Density model, and it identifies areas with no observed shooting crime at present that have latent shooting crime risk. Such findings and features are significant.
One of the limitations of this model is that it fails to avoid selection bias - it systematically under-predicts shooting crime in areas with high observed shooting crime cases and in majority non-white neighborhoods. This would have negative impact on the police force arrangement.
Another limitation is clustered in the choice of dependent variable - shooting crime. In fact, there are subsets of shooting crimes which might raise further questions on how shooting cases are correlated with other risk factors and how the spatial pattern takes the role. For instance, as a subset of more general gun violence, a drive-by shooting refers to an incident when someone fires a gun from a vehicle at another vehicle, a person, a structure, or another stationary object that has been commonly neglected but has caused a comparatively high fatal rate. Hence, it is still quite early to draw the conclusion that either of the model is consistently better than the other since there are still many factors and scenarios left to be taken into consideration.
Therefore, despite comparatively good performance on accuracy, any attempt to applying this model should be cautious about its limitations.